-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python3, Ventana support, import option #6
base: master
Are you sure you want to change the base?
Conversation
@markemus, I tested your script with a couple of slides from my institute. I am testing the results with QuPath and 3DHistech Case Viewer where you can access the label images. In case of a successful anonymization the label image cannot be found and the macro image is cropped so the label is not visible anymore. When I use the script on .ndpi files the anonymization works, although I get the following warnings (full output):
The script did not work with .mrsx files (old format). An Error is encountered and the slides still contain the label. This is the output (anonymized slide names):
However, in the next days I will have access to .mrxs files in the format and will report the results. |
BTW we discovered recently that the Aperio anonymization at least does NOT currently remove or modify macro images. Big problem. I've been working on a version that fixes that - it's working now and I'll push soon. I'm not too familiar with the other file formats (other than Ventana) so I don't plan on checking them, but someone more familiar with them should go over them with a fine toothed comb and make sure that everything that needs to be removed is actually being removed. I don't want to take them on and end up missing a barcode somewhere... |
This is very interesting. I just ran the script (my pull request to you in which I did not touch the Aperio code) on my .svs files and it worked. In which format are your Aperio files stored? And which software do you use for testing? |
This is the current code for Aperio anonymization. It doesn't touch the macro image at all. |
I see where my "mistake" is. I did not check the macro image beforehand. On my slides the macro image does not include the label at all but contains the main part of the glass slide including an annotation of where the tissue is (which in turn is used as the thumbnail image). So for my slides there is nothing to anonymize on the macro image.. Do the macro images of your aperio slides contain the label? |
I believe they often do. I have a new version of this that removes the macro, and also removes the filename from the ImageDescription tag which can contain PHI. I'll try to push the new version today. |
… into Tomatenbiss-master fixes for mirax in python3
fixes for mirax in python3
@Tomatenbiss I merged your Mirax fix as well - could you please run your tests again and make sure it still works for both formats? For aperio images the macro will be removed and the "Filename = (filename)" values in ImageDescription tags should now read "Filename = X". |
@markemus, I tested the script with my files and achieved the following results: 3DHistech (MRXS)
Aperio (SVS)
Hamamatsu (NDPI)
My next steps involve to check the metadata of my mrxs files. Do you have any experience with this? |
No, sorry, I've never dealt with mrxs before. But thank you for the test results, much appreciated! |
By the way I looked into it a bit more: in Aperio, the macro image is cut off at the top so that it does not show the label. However this cut happens at an arbitrary point, and some of the label (at least our labels) is still visible. It is hard to tell because the slide is backlit when the macro is captured, which makes the label appear very dark, but it is still very possible to reveal the data with some basic image manipulation (eg histogram equalization in the dark region). Additionally, some of our slides had labels on the lower half of the image, which could be made readable with the same method. |
@markemus @Tomatenbiss raising IOError on Aperio label not found and macro not found separately means that if there's no label but there is a macro, the macro won't be removed because the code will exit immediately after not finding the label. I think it would be better to just print the messages and move on rather than raise exceptions in those places. And if you're worried about PHI, you probably should remove every field that isn't explicitly one of the image metrics, not just filename. Date is notably questionable, as remarked in issue 2. Note as well that the "h" in Originalheight appears sometimes lowercase, so probably casing of the tags is inconsistent and they should all be compared without respect to letter casing. |
Also, according to openslide/openslide#297 (comment) it's not safe to rely on "label" and "macro" in the description field. |
@fiendish I'm not maintaining this code anymore, however: The exceptions exist to ensure that we can catch if phi is not removed. If the label is misidentified as a thumbnail, for example, we want to know that the anonymization failed. Removing additional fields is probably a good idea but wasn't necessary for our lab. Certainly fields like "project name" could be an issue if they exist. From what I've seen "label" and "macro" are consistent within openslide formats but inconsistent across formats. The info for particular formats can be found here: https://openslide.org/formats/ . Since the code handles each format separately this shouldn't cause problems. |
That's ok then. Feel free to ignore. I figured I'd tag you here just in case. I had just wanted to add information to this thread because it's likely a place where people will look if they go searching for SVS deidentification scripts.
The problem isn't using exceptions. The problem is where the exceptions are used.
Already-partially-redacted images and images generated by the Aperio GT450 are the same format as the rest. They're all valid Aperio SVS files. The GT450 ones just don't use those strings in the descriptions of the label/macro images, and Leica (allegedly, anyway, because the information is secondhand) says that looking for those strings is the wrong way to check for those frames and that they should be identified by their SUBFILETYPE (254) TIFF tag being either 1 or 9 instead of 0 as mentioned in the openslide bug report. I suppose one could make entirely separate handlers for GT450-generated SVS files and already-partially-redacted SVS files, but this function is called do_aperio_svs not do_aperio_svs_except_for_ones_from_gt450s_or_already_partially_redacted_ones. 🙂 |
@Tomatenbiss Metadata in MRXS files is stored in the Slidedat.ini file, which is essentially a text file that can be modified with only a few lines of code. |
Just out of curiosity, is there any reason not to merge this? It currently looks as if it has all of the functionality of the python2 version. |
@c-arthurs I haven't been actively maintaining this repo, and haven't taken the time to go through the PRs here. It might happen in the future, but isn't high on my priority list. |
@c-arthurs, @bgilbert : Within the EMPAIA project, we have now developed our own solution for anonymizing WSIs (in various formats). This is currently available via Gitlab . The paper for this is currently in review, the preprint can already be viewed at arXiv. |
So just an update, we have been using this in production for a few years now and it's quite stable. It covers all of the PHI that we've found in our own datasets and the troublesome fields that we've encountered, but we can't guarantee that it will cover every possible use case. There is an update I need to push that adds barcode deletion for Hamamatsu. We also have a couple of extension scripts that save the PHI to separate files as a backup as well as a basic web GUI although those may be out of scope. As far as the future goes, I am maintaining it again for now and I have a coworker who can probably take it over in the future if it comes up again. @bgilbert , would you prefer if I made a fork? There are a lot of improvements in this version, it adds support for a new format and patches some gaps in the old formats that were letting parts of the labels, barcodes etc through, as well as moving to python3. @Tomatenbiss , very nice to see a new anonymization tool, and the new support for phillips! It sounds quite robust and I hope to play around with it soon when I get a chance. BTW our team has also experimented with a number of standardized formats (ometiff and a custom aperio-style format of our own) if you want to compare notes some time. |
I'm not 100% certain that it works for the MRXS format- it runs without error on a test image but I'm not sure how to check that the result is truly de-identified.